Combining translation and language model scoring for domain-specific data filtering

نویسندگان

  • Saab Mansour
  • Joern Wuebker
  • Hermann Ney
چکیده

The increasing popularity of statistical machine translation (SMT) systems is introducing new domains of translation that need to be tackled. As many resources are already available, domain adaptation methods can be applied to utilize these recourses in the most beneficial way for the new domain. We explore adaptation via filtering, using the crossentropy scores to discard irrelevant sentences. We focus on filtering for two important components of an SMT system, namely the language model (LM) and the translation model (TM). Previous work has already applied LM cross-entropy based scoring for filtering. We argue that LM cross-entropy might be appropriate for LM filtering, but not as much for TM filtering. We develop a novel filtering approach based on a combined TM and LM cross-entropy scores. We experiment with two large-scale translation tasks, the Arabicto-English and English-to-French IWSLT 2011 TED Talks MT tasks. For LM filtering, we achieve strong perplexity improvements which carry over to the translation quality with improvements up to +0.4% BLEU. For TM filtering, the combined method achieves small but consistent improvements over the standalone methods. As a side effect of adaptation via filtering, the fully fledged SMT system vocabulary size and phrase table size are reduced by a factor of at least 2 while up to +0.6% BLEU improvement is observed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Translation Strategies of Culture-Specific Items from English to Persian in Translation of "Othello"

This study investigated the translation strategies of culture-specific items in translation of 'Othello' by William Shakespeare into Persian by Abdolhossein Nooshin. First, the English culture-specific items and their corresponding translations were identified. Then, the frequency of the strategies used by the translator according to Newmark's translation model and Venuti's domestication and fo...

متن کامل

Cultural Elements in the Translation of Children's Literature: Persian translation of Roald Dahl’s Matilda in focus

Translation can have long-term effects on all languages and cultures. It is not a mere linguistic act, but mostly a cultural act, since language is by nature one of the major carriers of cultural elements. Thus, the translator’s job is not just transferring the meaning of words and sentences from the source text to the target text. Culture-specific items often cause translation problems. Identi...

متن کامل

Cultural Elements in the Translation of Children's Literature: Persian translation of Roald Dahl’s Matilda in focus

Translation can have long-term effects on all languages and cultures. It is not a mere linguistic act, but mostly a cultural act, since language is by nature one of the major carriers of cultural elements. Thus, the translator’s job is not just transferring the meaning of words and sentences from the source text to the target text. Culture-specific items often cause translation problems. Identi...

متن کامل

Cultural Frame and Translation of Pronominal Adverbs in Legal English

This paper explores the relationship between cultural knowledge and the specific meaning of a pronominal adverb in legal English where Chinese translators need to get the correct translation in their venture into translating the language of law. On the one hand, relying on the relevant legal cultural knowledge functioning as domain-general reference within a community or jurisdiction, tra...

متن کامل

Domain Adaptation for Medical Text Translation using Web Resources

This paper describes adapting statistical machine translation (SMT) systems to medical domain using in-domain and general-domain data as well as webcrawled in-domain resources. In order to complement the limited in-domain corpora, we apply domain focused webcrawling approaches to acquire indomain monolingual data and bilingual lexicon from the Internet. The collected data is used for adapting t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011